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Abstract 

14 -minimization  refers  to  finding  the  minimum  4-norm  solution  to  an  underdetermined 
linear  system  b  =  Ax.  It  has  recently  received  much  attention,  mainly  motivated  by 
the  new  compressive  sensing  theory  that  shows  that  under  quite  general  conditions 
the  minimum  14-norm  solution  is  also  the  sparsest  solution  to  the  system  of  linear 
equations.  Although  the  underlying  problem  is  a  linear  program,  conventional  algo¬ 
rithms  such  as  interior-point  methods  suffer  from  poor  scalability  for  large-scale  real 
world  problems.  A  number  of  accelerated  algorithms  have  been  recently  proposed  that 
take  advantage  of  the  special  structure  of  the  f4-minimization  problem.  In  this  paper, 
we  provide  a  comprehensive  review  of  five  representative  approaches,  namely,  Gradi¬ 
ent  Projection ,  Homotopy,  Iterative  Shrinkage- Thresholding,  Proximal  Gradient,  and 
Augmented  Lagrange  Multiplier.  The  work  is  intended  to  fill  in  a  gap  in  the  existing 
literature  to  systematically  benchmark  the  performance  of  these  algorithms  using  a 
consistent  experimental  setting.  In  particular,  the  paper  will  focus  on  a  recently  pro¬ 
posed  face  recognition  algorithm,  where  a  sparse  representation  framework  has  been 
used  to  recover  human  identities  from  facial  images  that  may  be  affected  by  illumi¬ 
nation,  occlusion,  and  facial  disguise.  MATLAB  implementations  of  the  algorithms 
described  in  this  paper  have  been  made  publicly  available. 


1  Introduction 

Compressive  sensing  (CS)  has  been  one  of  the  hot  topics  in  the  signal  process¬ 
ing  and  optimization  communities  in  the  last  five  years  or  so.  In  CS  theory 
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[11,  19,  10,  13],  it  has  been  shown  that  the  minimum  i\ -norm  solution  to  an 
underdetermined  system  of  linear  equations  is  also  the  sparsest  possible  solu¬ 
tion  under  quite  general  conditions.  More  specifically,  suppose  there  exists  an 
unknown  signal  x0  £  Rn,  a  measurement  vector  b  £  R.d  (d  <  n),  and  a  mea¬ 
surement  matrix  A  £  Rdx™  such  that  A  is  full  rank  and  b  =  Ax0.  Recovering 
Xq  given  A  and  b  constitutes  a  non-trivial  linear  inversion  problem,  since  the 
number  of  measurements  in  b  is  smaller  than  the  number  of  unknowns  in  xq. 
A  conventional  solution  to  this  problem  is  the  Linear  Least  Squares  which  finds 
the  minimum  f^-norm  solution  (or  the  solution  of  least  energy)  to  this  system. 
Recently,  it  was  shown  that  if  x0  is  sufficiently  sparse1  and  the  sensing  matrix 
A  is  incoherent  with  the  basis  under  which  Xq  is  sparse  (i.e. ,  the  identity  ma¬ 
trix  in  the  standard  form),  then  Xq  can  be  exactly  recovered  by  computing  the 
minimum  p-norm  solution,  as  given  by  the  following  optimization  problem: 

(Pi)  :  min  ||a:||i  subj.  to  b  =  Ax.  (1) 

X 

We  refer  to  the  above  problem  as  f  | -minimization  or  fi-min  in  this  paper.  The 
sparsity-seeking  property  of  (Pi)  has  been  shown  to  have  applications  in  geo¬ 
physics,  data  compression,  image  processing,  sensor  networks,  and  more  re¬ 
cently,  in  computer  vision.  The  interested  reader  is  referred  to  [11,  2,  10,  48,  51] 
for  a  comprehensive  review  of  these  applications. 

We  note  that  (Pi)  can  be  recast  as  a  linear  program  (LP)  and  can  be  solved 
by  conventional  methods  such  as  interior-point  methods.  However,  the  com¬ 
putational  complexity  of  these  general-purpose  algorithms  is  often  too  high  for 
many  real-world,  large-scale  applications.  Alternatively,  heuristic  greedy  algo¬ 
rithms  have  been  developed  to  approximate  (Pl).  Orthogonal  matching  pursuit 
(OMP)  [17]  and  least  angle  regression  (LARS)  [22]  are  two  examples  in  this  cat¬ 
egory.  Empirically,  these  greedy  algorithms  work  better  when  Xq  is  very  sparse, 
but  will  deviate  from  the  solution  of  (Pi)  when  the  number  of  non-zero  entries 
in  Xo  increases,  as  illustrated  in  [42],  In  other  words,  the  greedy  methods  do 
not  come  with  strong  theoretical  guarantees  for  global  convergence. 

Besides  scalability,  another  important  requirement  for  real-world  applica¬ 
tions  is  robustness  to  noise,  namely,  the  observation  vector  b  may  be  corrupted 
by  data  noise.  To  take  into  account  of  the  noise,  one  can  relax  the  equality 
constraint  as  follows: 

(Pi,2)  :  min  ||a;||i  subj.  to  \\b  -  Ax\\2  <  e,  (2) 

X 

where  e  >  0  is  a  pre-determined  noise  level.  If  the  observation  vector  b  is 
assumed  to  be  corrupted  by  white  noise  of  magnitude  up  to  e,  then  the  ground- 
truth  signal  Xq  can  be  well  approximated  by  (Pii2),  dubbed  basis  pursuit  tie- 
noising  (BPDN)  in  [13,  12]. 

Based  on  the  context  of  the  data  noise,  the  £2-norm  used  in  the  penalty  term 
can  also  be  replaced  by  other  fp-nonns.  In  this  paper,  we  also  focus  on  a  special 

1  By  sparse,  we  mean  that  most  entries  in  the  vector  are  zero. 
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case  of  (Pi)  given  by: 

(Pi  i)  :  min  ||cc|| i  subj.  to  ||6—  Ax ||i  <  e.  (3) 

X 

This  formulation  has  been  used  in  [49,  47]  for  robust  face  recognition.  Although 
(Pl,i)  can  be  solved  by  any  algorithm  designed  for  (Pi),  as  we  will  later  explain, 
its  special  structure  can  be  further  exploited  to  develop  scalable  methods. 

In  light  of  the  high  interest  in  finding  more  efficient  algorithms  to  solve  these 
problems,  many  new  algorithms  have  been  proposed.  Although  it  is  impossible 
to  summarize  all  existing  algorithms  in  the  literature,  in  this  paper,  we  provide 
a  comprehensive  review  of  five  representative  methods,  namely,  Gradient  Pro¬ 
jection  (GP)  [23,  31],  Homotopy  [41,  34,  21],  Iterative  Shrinkage- Thresholding 
(1ST)  [16,  14,  26,  50],  Proximal  Gradient  (PG)  [38,  39,  5,  6],  and  Augmented 
Lagrange  Multiplier  (ALM)  [8,  52].  Although  the  paper  is  mainly  focused  on 
fast  implementations  of  G-min,  the  reader  may  refer  to  [10,  45]  for  a  broader  dis¬ 
cussion  about  recovering  sparse  solutions  via  other  approaches,  such  as  greedy 
pursuit- type  algorithms. 

This  paper  intends  to  fill  in  a  gap  in  the  existing  literature  to  systematically 
benchmark  the  performance  of  these  algorithms  using  a  fair  and  consistent  ex¬ 
perimental  setting.  Due  to  the  attention  given  to  compressive  sensing  and  G- 
minimization,  other  more  sophisticated  solutions  continue  to  be  developed  at  a 
rapid  pace.  More  recent  developments  include  subspace  pursuit  [15],  CoSaMP 
[37],  approximate  message-passing  algorithm  [20],  and  Bregman  iterative  algo¬ 
rithm  [53] ,  to  name  a  few.  However,  we  do  not  believe  that  there  exists  an  overall 
winner  that  could  achieve  the  best  performance  in  terms  of  both  speed  and  ac¬ 
curacy  for  all  applications.  Therefore,  in  addition  to  extensive  simulations  using 
synthetic  data,  our  experiments  will  be  focused  on  a  specific  application  of  ro¬ 
bust  face  recognition  proposed  in  [49] ,  where  a  sparse  representation  framework 
has  recently  been  developed  to  recognize  human  identities  from  facial  images. 
To  aid  peer  evaluation,  all  algorithms  discussed  in  this  paper  have  been  made 
available  on  our  website  as  a  MATLAB  toolbox: 

http : //www . eecs . berkeley . edu/~yang/sof tware/llbenchmark/ 

1.1  Notation 

For  a  vector  x  £  Rn,  we  denote  by  x+  and  x _  the  vectors  that  collect  the 
positive  and  negative  coefficients  of  x,  respectively: 

x  =  x+  —  x_ ,  x+  >  0,  x_  >0.  (4) 

We  also  denote 

X  =  diag(cci,  x2,  •  •  •  ,  xn)  €  R"xn  (5) 

as  a  square  matrix  with  the  coefficients  of  x  as  its  diagonal  and  zero  other¬ 
wise.  The  concatenation  of  two  (column)  vectors  will  be  written  following  the 
MATLAB  convention:  [xqaq]  =  [x\]\  [aq,^]  =  [*1  *2].  We  denote  by  1  a 
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vector  whose  components  are  all  one  with  dimension  defined  within  the  con¬ 
text.  We  represent  the  Euclidean  or  f  2-norm  by  ||  •  || 2  and  the  fi-norm  by  ||  •  ||  1 . 
The  notation  ||  •  ||  represents  the  ^2-norm  for  vectors  and  the  spectral  norm  for 
matrices. 

For  any  real-valued  differentiable  function  /(•),  we  denote  its  gradient  by 
V/(-).  If  a  real-valued  convex  function  g(-)  is  not  differentiable  everywhere,  we 
represent  its  subgradient  by  dg(-)  defined  as  follows: 

dg{x )  =  {y  £  :  g(x)  —  g(x)  >  yT{x  —  x),\/x  £  K"}.  (6) 

1.2  Primal-Dual  Interior-Point  Methods 

We  first  discuss  a  classical  solution  to  the  fn-min  problem  (Pi),  called  the  primal- 
dual  interior-point  method,  which  is  usually  attributed  to  the  works  of  [24,  29, 
35,  36,  32].  For  the  sake  of  simplicity,  we  assume  here  that  the  sparse  solution 
x  is  nonnegative.2  Under  this  assumption,  it  is  easy  to  see  that  {Pi)  can  be 
converted  to  the  standard  primal  and  dual  forms  in  linear  programming: 

Primal  (P) 

nrin^,  cTx 
subj.  to  Ax  =  b 
x  >  0 

where  for  U-niin,  c  =  1.  The  primal-dual  algorithm  simultaneously  solves  for 
the  primal  and  dual  optimal  solutions  [9]. 

It  was  proposed  in  [24]  that  (P)  can  be  converted  to  a  family  of  logarithmic 
barrier  problems3: 

(Pp)  :  minx  cTx  -  p  X)"=1  loSxi  /gt 

subj .  to  Ax  =  b,  x  >  0  ' 

Clearly,  a  feasible  solution  x  for  (PM)  cannot  have  zero  coefficients.  Therefore, 
we  define  the  interiors  of  the  solution  domains  for  (P)  and  ( D )  as: 

P++  =  {x  :  Ax  =  b,  x  >  0}, 

D++  =  {{y,z)  :  ATy  +  z  =  c,z  >  0},  (9) 

S++  =  P++  x  D++. 

Assuming  that  the  above  sets  are  non-empty,  it  can  be  shown  that  (P^) 
has  a  unique  global  optimal  solution  x{p)  for  all  p  >  0.  As  p  — >  0,  x{p)  and 
{y{p),  z{p ))  converge  to  optimal  solutions  of  problems  (P)  and  {D)  respectively 
[35,  36], 

The  primal-dual  interior-point  algorithm  seeks  the  domain  of  the  central 
trajectory  for  the  problems  (P)  and  {D)  in  S++,  where  the  central  trajectory  is 

2  This  constraint  can  be  easily  removed  by  considering  the  linear  system  b  = 
[A,  —  A][£C-)_;  cc — ] ,  where  [x-i;x—]  is  also  nonnegative. 

2  In  general,  any  smooth  function  'I'  that  satisfies  H/fCU)  =  —00  is  a  valid  barrier  function 

[27]. 


maxyiZ 
subj.  to 


Dual  (D) 
b1  y 

ATy  +  z  = 
z>  0, 


(7) 
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defined  as  the  set  S  =  {(x(y),  y{y),  z(y))  :  y  >  0}  of  solutions  to  the  following 
system  of  equations: 


XZ1  =  yl,  Ax  =  b,ATy  +  z  =  c, 

x  >  0,  z  >  0.  1  J 

The  above  condition  is  also  known  as  the  Karush-Kuhn-Tucker  (KKT)  station¬ 
ary  condition  for  the  convex  program  (PM)  [36,  32]. 

Hence,  the  update  rule  on  the  current  value  (x^k\y^k\z^)  is  defined  by 
the  Newton  direction  ( Ax ,  Ay,  A z),  which  is  computed  as  the  solution  to  the 
following  set  of  linear  equations: 

Z^Ax  +  X^Az  =  /il- A(fe)z(fe), 

AAx  =  0,  (11) 

ATAy  +  A  z  =  0, 

where  y  is  a  penalty  parameter  that  is  generally  different  from  y  in  (PM). 

In  addition  to  the  update  rule  (11),  an  algorithm  also  needs  to  specify  a 
stopping  criterion  when  the  solution  is  close  to  the  optimum.  For  fi-min,  some 
simple  rules  can  be  easily  evaluated: 

1.  The  relative  change  of  the  sparse  support  set  becomes  small; 

2.  The  relative  change  (in  the  sense  of  the  f^-norm)  of  the  update  of  the 
estimate  becomes  small; 

3.  The  relative  change  of  the  objective  function  becomes  small. 

A  more  detailed  discussion  about  choosing  good  stopping  criteria  in  different 
applications  is  postponed  to  Section  3. 

Algorithm  1  summarizes  a  conceptual  implementation  of  the  interior-point 
methods.4  For  more  details  about  how  to  choose  the  initial  values  ( x ,  y ,  z^ ) 
and  the  penalty  parameter  y,  the  reader  is  referred  to  [32,  36].  Algorithm  1  re¬ 
quires  a  total  of  0(y/n)  iterations,  and  each  iteration  can  be  executed  in  0(?i3) 
operations  for  solving  the  linear  system  (11). 

In  one  simulation  shown  in  Figure  1,  the  computational  complexity  of  Al¬ 
gorithm  1  with  respect  to  (w.r.t.)  the  sensing  dimension  d  grows  much  faster 
than  the  other  five  algorithms  in  comparison.  For  example,  at  d  =  1900,  the 
fastest  algorithm  in  this  simulation,  i.e.,  DALM,  only  takes  about  0.08  sec  to 
complete  one  trial,  which  is  more  than  250  times  faster  than  PDIPA.  For  this 
reason,  basic  BP  algorithms  should  be  used  with  caution  in  solving  real-world 
applications.  The  details  of  the  simulation  are  explained  later  in  Section  3. 

Next,  we  will  review  the  five  fast  fj-miii  algorithms  shown  in  Figure  1, 
namely,  Gradient  Projection  in  Section  2.1,  Homotopy  in  Section  2.2,  Iterative 

4  There  are  multiple  versions  of  the  primal-dual  interior-point  solver  implemented  in  MAT- 
LAB.  Notable  examples  include  SparseLab  at  http://sparselab.stanford.edu/,  the  cvx  en¬ 
vironment  at  http://cvxr.com/cvx/,  and  the  timagic  package  at  http : //www . acm.  caltech. 
edu/llmagic/. 
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Algorithm  1  Primal-Dual  Interior-Point  Algorithm  (PDIPA) 

Input:  A  full  rank  matrix  A  £  Rdxn,  d  <  n,  a  vector  b  £  initial  guess 
(x(°\  y(°\  z^0)).  Iteration  k  ■£-  0.  Initial  penalty  p  and  a  decreasing  factor 
0  <  8  <  y/n. 

1:  repeat 

2:  k  i —  k  T 1,  p  i  ^r(l  (5/ v^)- 
3:  Solve  (11)  for  (Ate,  Ay,  Az). 

4:  X t—  x(k~1'>  +  Ax ,  y(fe)  «—  -I-  Ay,  2^  •<—  +  Az. 

5:  until  stopping  criterion  is  satisfied. 

Output:  x*  £-  x(k\ 


Projection  Dimension 


2000 


<D 
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— e—  Homotopy 
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Fig.  1:  Average  run  time  of  PDIPA  in  comparison  with  five  fast  implementations 
under  similar  estimation  accuracy.  The  simulation  setup:  n  =  2000, 
k  =  200.  The  projection  matrices  are  randomly  generated  based  on  the 
standard  normal  distribution  with  the  dimension  varies  from  300  to  1900. 
The  support  of  the  ground  truth  a:o  is  randomly  selected  at  each  trial, 
and  the  nonzero  coefficients  are  sampled  from  the  normal  distribution. 


Shrinkage- Thresholding  in  Section  2.3,  Proximal  Gradient  in  Section  2.4,  and 
Augmented  Lagrange  Multiplier  in  Section  2.5.  The  algorithms  of  L1LS,  Ho¬ 
motopy,  and  SpaRSA  are  provided  by  their  respective  authors.  We  have  also 
provided  our  implementation  of  FISTA  and  DALM  on  our  website. 
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2  Fast  £i-Min  Algorithms 

2.1  Gradient  Projection  Methods 

We  first  discuss  Gradient  Projection  (GP)  methods  that  seek  a  sparse  repre¬ 
sentation  x  along  a  certain  gradient  direction,  which  induces  much  faster  con¬ 
vergence  speed.  The  approach  reformulates  the  f^-min  problem  as  a  quadratic 
programming  (QP)  problem. 

We  start  with  the  G-min  problem  (Pi, 2)-  It  is  equivalent  to  the  so-called 
LASSO  problem  [44]: 

{QP)  :  min  ||b  —  Ax\\%  subj.  to  ||sc||i  <  ct,  (12) 

X 

where  er  >  0  is  an  appropriately  chosen  constant.  Using  a  Lagrangian  for¬ 
mulation,  the  problem  (Pl, 2)  (and  hence,  ( QP ))  can  both  be  rewritten  as  an 
unconstrained  optimization  problem: 

x*  =  argminP(a:)  =  argmin  I||b  —  Aa;|||  +  A||ai||i,  (13) 

X  X  2 

where  A  is  the  Lagrangian  multiplier. 

In  the  literature,  there  exist  two  slightly  different  methods  that  formu¬ 
late  (13)  as  a  quadratic  programming  problem,  namely,  gradient  projection 
sparse  representation  (GPSR)  [23]  and  truncated  Newton  interior-point  method 
(TNIPM)  [31]. 5 

To  formulate  the  GPSR  algorithm,  we  can  separate  the  positive  coefficients 
x+  and  the  negative  coefficients  cc_  in  x,  and  rewrite  (13)  as 

min,,,  Q(x)  =  |||£»  —  [A,  -A][x+;  x_}\\l  +  \1T {x+  +  x_) 

subj.  to  x+  >  0,x_  >0.  ' 


Problem.  (14)  can  be  rewritten  in  the  standard  QP  form  as 

min  Q(z)  =  cTz  +  1 zT  Bz 
subj.  to  z  >  0, 


(15) 


where  z  =  [x+\  x_],  c  =  A1  +  [—  ATb;  A 2  b],  and 

_  [  ATA  -~ATA 
B  ~  -ATA  ATA 


(16) 


We  note  that  the  gradient  of  Q{z)  is  defined  as 

X7zQ{z)  =  c  +  Bz.  (17) 

5  A  MATLAB  implementation  of  GPSR  is  available  at  http :  / /www.  lx .  it.pt/~mtf/GPSR.  A 
MATLAB  Toolbox  for  TNIPM  called  L1LS  is  available  at  http://www.stanford.edu/~boyd/ 
ll.ls/. 
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This  leads  to  a  steepest-descent  algorithm  that  searches  from  each  iterate  z ^ 
along  the  negative  gradient  —  X7Q(z): 

z(k+i)  =  z(k)  _  a(fc)vQ(^W),  (18) 

where  a ^  is  the  step  size.  This  can  be  solved  by  a  standard  line-search  process 
[28].  For  example,  in  [23],  the  direction  vector  g ^  is  defined  as 

„(*)  if  >  0  or  (VQ(z<fc)))i  <0  nqx 

\  0,  otherwise. 


Then  the  step  size  for  the  update  is  chosen  to  be 

a{k)  =  argrnin Q{z(k)  -  ag{k)),  (20) 

a. 


which  has  a  closed-form  solution 

„(*)  =  (gW)Vfc) 

(g(k))TBgW 


(21) 


The  computational  complexity  and  convergence  of  GPSR  is  difficult  to  es¬ 
timate  exactly  [23].  Another  issue  is  that  the  formulation  of  (15)  doubles  the 
dimension  of  the  equations  from  (13).  Therefore,  the  matrix  operations  involv¬ 
ing  B  must  take  into  account  its  special  structure  w.r.t.  A  and  AT . 

The  second  GP  algorithm,  which  we  will  benchmark  in  Section  3,  is  truncated 
Newton  interior-point  method  (TNIPM)  [31] .  It  transforms  the  same  objective 
function  (13)  to  a  quadratic  program  but  with  inequality  constraints: 


min  \\\Ax-b\\l  +  XYJi=iui 
subj.  to  —Ut  <  Xi  <  Ui,  i  =  1,  •  •  •  ,n. 


(22) 


Then  a  logarithmic  barrier  function  for  the  constraints  —  tq  <  Xi  <  rq  can  be 
constructed  as  follows  [24]: 

<F(a;,  u)  =  -^2  log  (m  +  Xi)  -  ^  log(uj  -  x,).  (23) 

i  i 

Over  the  domain  of  ( x ,  u),  the  central  path  consists  of  the  unique  minimizer 
(x*  (t) ,  u*  (t))  of  the  convex  function 


Ft(x,  u)  =  t(\\Ax  -  6|||  +  X^Ui)  +  $(x,u),  (24) 

i— 1 


where  the  parameter  t  G  [0,  oo). 

Using  the  primal  barrier  method  discussed  in  Section  1.2,  the  optimal  search 
direction  using  Newton’s  method  is  computed  by 


Ax 
A  u 


V2Ut  (*,«)• 


X7Ft(x,u)  G  R2n. 


(25) 
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Again,  for  large-scale  problems,  directly  solving  (25)  is  computationally  ex¬ 
pensive.  In  [31],  the  search  step  is  accelerated  by  a  preconditioned  conjugate 
gradients  (PCG)  algorithm,  where  an  efficient  preconditioner  is  proposed  to  ap¬ 
proximate  the  Hessian  of  |||Aa;  —  6j||.  The  reader  is  referred  to  [30,  40]  for  more 
details  about  PCG. 


2.2  Homotopy  Methods 

One  of  the  drawbacks  of  PDIPA  is  that  they  require  the  solution  sequence 
x(p)  to  be  close  to  a  “central  path”  as  p  — >•  0,  which  sometimes  is  difficult  to 
satisfy  and  computationally  expensive  in  practice.  In  this  section,  we  review  an 
approach  called  Homotopy  methods  [41,  34,  21]  that  can  mitigate  these  issues. 

We  recall  that  (Pi, 2)  can  be  written  as  an  unconstrained  convex  optimization 
problem: 

x*  =  argminx  F(x)  =  argmin^. \\\b  -  Ax\\%  +  A||®||i,  ^ 

=  arg  minx  f(x)  +  Xg(x)  ^ 

where  f(x)  =  ^\\b  —  A®|||,  g{x)  =  ||a;||i,  and  A  >  0  is  the  Lagrange  multiplier. 
On  one  hand,  w.r.t.  a  fixed  A,  the  optimal  solution  is  achieved  when  0  €  dF(x). 
On  the  other  hand,  similar  to  the  interior-point  algorithm,  if  we  define 


X  =  ix*\  '■  A  G  [0, 00)}, 


(27) 


X  identifies  a  solution  path  that  follows  the  change  in  A:  when  A  — »•  00,  x*x  =  0; 
when  A  — >  0,  x\  converges  to  the  solution  of  (Pi). 

The  Homotopy  methods  exploit  the  fact  that  the  objective  function  F(x) 
undergoes  a  homotopy  from  the  £2  constraint  to  the  £1  objective  in  (26)  as  A 
decreases.  One  can  further  show  that  the  solution  path  X  is  piece-wise  constant 
as  a  function  of  A  [41,  22,  21].  Therefore,  in  constructing  a  decreasing  sequence 
of  A,  it  is  only  necessary  to  identify  those  “breakpoints”  that  lead  to  changes 
of  the  support  set  of  x*x,  namely,  either  a  new  nonzero  coefficient  added  or  a 
previous  nonzero  coefficient  removed. 

The  algorithm  operates  in  an  iterative  fashion  with  an  initial  value  =  0. 
In  each  iteration,  given  a  nonzero  A,  we  solve  for  x  satisfying  dF(x )  =  0.  The 
first  summand  /  in  (26)  is  differentiable:  V/  =  AT(Ax  —  b)  =  —c{x).  The 
subgradient  of  g(x)  =  ||a;||i  is  given  by: 


u(x)  =  <9||ic||i 


jit  G  I"  : 


Ui  =  sgn(a;i),a;i  ^  0l 
■Ui  G  [-1, 1  ],Xi  =  0  J 


(28) 


Thus,  the  solution  to  dF(x)  =  0  is  also  the  solution  to  the  following  equation: 

c(x)  =  A1  b  —  AT  Ax  =  Xu(x).  (29) 


By  the  definition  (28),  the  sparse  support  set  at  each  iteration  is  given  by 


I={i:\cf)\  =  X}. 


(30) 
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The  algorithm  computes  the  update  for  x<k>  in  terms  of  the  positive/negative  di¬ 
rections  for  its  coefficients  and  the  magnitude.  Specifically,  the  update  direction 
on  the  sparse  support  d^k\l)  is  the  solution  to  the  following  system: 

A^AId^k\l)=sgn(c^\l)),  (31) 


where  Ax  is  a  submatrix  of  A  that  collects  the  column  vectors  of  A  w.r.t.  1, 
and  c^k\Z)  is  a  vector  that  contains  the  coefficients  of  c ^  w.r.t.  I.  For  the 
coefficients  whose  indices  are  not  in  I,  their  update  directions  are  manually  set 
to  zero.  Along  the  direction  indicated  by  d!k> ,  there  are  two  scenarios  when  an 
update  on  x  may  lead  to  a  breakpoint  where  the  condition  (29)  is  violated.  The 
first  scenario  occurs  when  an  element  of  c  not  in  the  support  set  would  increase 
in  magnitude  beyond  A: 


7+  =  min 


A  —  Cj  A  +  Q  1 

1  -  ajAxd^k\l)  ’  1  +  a?AxSk\X)  j  ' 


(32) 


The  index  that  achieves  y+  is  denoted  as  i+.  The  second  scenario  occurs  when 
an  element  of  c  in  the  support  set  X  crosses  zero,  violating  the  sign  agreement: 


7  =  mm.{-Xi/di}. 


(33) 


The  index  that  achieves  7“  is  denoted  as  i~ .  Hence,  the  homotopy  algorithm 
marches  to  the  next  breakpoint,  and  updates  the  sparse  support  set  by  either 
appending  I  with  i+  or  removing  i~ : 

x{k+i)  =  x(k)  +min{7+;7-}d«.  (34) 

The  algorithm  terminates  when  the  relative  change  in  x  between  consecutive 
iterations  is  sufficiently  small.  Algorithm  2  summarizes  an  implementation  of 
Homotopy  methods.6 

Overall,  solving  (31)  using  a  Cholesky  factorization  and  the  addition/removal 
of  the  sparse  support  elements  dominate  the  computation.  Since  one  can  keep 
track  of  the  rank-1  update  of  A^Az  in  solving  (31)  using  0(d2)  operations 
in  each  iteration,  the  computational  complexity  of  the  homotopy  algorithm  is 
0(kd2  +  kdn ). 

Finally,  we  want  to  point  out  that  Homotopy  has  been  shown  to  share  some 
connections  with  two  greedy  t\  -min  approximations,  namely,  least  angle  regres¬ 
sion  (LARS)  [22]  and  polytope  faces  pursuit  (PFP)  [42].  For  instance,  if  the 
ground-truth  signal  Xq  has  at  most  k  non-zero  components  with  k  «  n,  all 
three  algorithms  can  recover  it  in  k  iterations.  On  the  other  hand,  LARS  would 
never  remove  indices  from  the  sparse  support  set  during  the  iteration,  while 
Homotopy  and  PFP  include  mechanisms  to  remove  coefficients  from  the  sparse 
support.  More  importantly,  Homotopy  provably  solves  fA-min  (Pi),  while  LARS 
and  PFP  are  only  approximate  solutions.  A  more  detailed  comparison  between 
Homotopy,  LARS,  and  PFP  can  be  found  in  [21] . 

6  A  MATLAB  implementation  [1]  can  be  found  at  http://users.ece.gatech.edu/~sasif/ 
homotopy/. 
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Algorithm  2  Homotopy 

Input:  A  full  rank  matrix  A  =  [iq,---  ivn\  €  Rdxra, 

d  <  n,  a  vector  b  £  Rd,  initial  Lagrangian  parameter  A  = 

2||Ar&||00. 

1:  Initialization:  k  ■£-  0.  Find  the  first  support  index:  i  =  arg max"=1  ||ujfo||, 
!  =  {*}• 

2:  repeat 
3:  k  4 —  k  A  1. 

4:  Solve  for  the  update  direction  d ^  in  (31). 

5:  Compute  the  sparse  support  updates  (32)  and  (33):  7*  <—  min{7+,7-}. 

6:  Update  x(k\  X,  and  A  A  —  7*. 

7:  until  stopping  criterion  is  satisfied. 

Output:  x*  <-  £c(fc). 


2.3  Iterative  Shrinkage-Thresholding  Methods 

Although  Homotopy  employs  a  more  efficient  iterative  update  rule  that  only  in¬ 
volves  operations  on  those  submatrices  of  A  corresponding  to  the  support  set  of 
the  current  vector  x,  it  may  not  be  as  efficient  when  the  sparsity  k  and  the  ob¬ 
servation  dimension  d  grow  proportionally  with  the  signal  dimension  n.  In  such 
scenarios,  one  can  show  that  the  worst-case  computational  complexity  is  still 
bounded  by  0(n3).  In  this  section,  we  discuss  Iterative  Shrinkage-Thresholding 
(1ST)  methods  [16,  14,  26,  50],  whose  implementation  mainly  involves  simple 
operations  such  as  vector  algebra  and  matrix- vector  multiplications.  This  is 
in  contrast  to  most  past  methods  that  all  involve  expensive  operations  such  as 
matrix  factorization  and  solving  linear  least  squares  problems.  A  short  survey 
on  the  applications  of  1ST  can  be  found  in  [53] . 

In  a  nutshell,  1ST  considers  solving  (Pi, 2)  as  a  special  case  of  the  following 
composite  objective  function: 

minP(x)  =  f(x)  +  Xg(x),  (35) 

X 

where  /  :  — >  R  is  a  smooth  and  convex  function,  and  g  :  K"  — >  R  as 

the  regularization  term  is  bounded  from  below  but  not  necessarily  smooth  nor 
convex.  For  U-iriin  in  particular,  g  is  also  separable,  that  is, 

n 

9(x)  =^2gi(xi).  (36) 

i= 1 

Clearly,  let  f(x)  =  |||6  —  Ax\\^  and  g(x)  =  ||a:||i,  and  the  objective  function 
(35)  becomes  the  unconstrained  BPDN  problem. 

The  update  rule  to  minimize  (35)  is  computed  using  a  second-order  approx- 
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imation  of  /  [50,  5]: 

a;(fc+1)  =  argmin  +  — 

+\\\x  -  JK(fe)  HI  •  V2f{x^)  +  A g{x)} 
ss  argminx{(:E  —  x^^V  f(x^) 

+^\\x  -  x^\\l  +  Ag(tc)} 

=  argminx{i||a:-w(fc)||l  +  ^g{x)}, 

=  Ga(*)(*W), 


where 

«(fc)  =  *(*>  -  Jry  V/^).  (38) 

In  (37),  the  Hessian  V2 / (x^)  is  approximated  by  a  diagonal  matrix  a^I. 

If  we  replace  g{x)  in  (37)  by  the  fi-norm  ||  tc||  i ,  which  is  a  separable  function, 
then  Ga(k){x^)  has  a  closed-form  solution  w.r.t.  each  component: 


where 


r(fc+!) 


arg  minX( 
soft 


f  (Xi  —  IL^)2 

5  ck(^)  J  5 


soft  (it,  a) 


sgn(u)  max{|ti|  —  a,  0} 
f sgn(u)(|w|  -  a)  if  |u|  >  a 
\  0  otherwise 


(39) 


(40) 


is  the  soft-thresholding  or  shrinkage  function  [18]. 

There  are  two  free  parameters  in  (37),  namely,  the  regularizing  coefficient 
A  and  the  coefficient  that  approximates  the  hessian  matrix  V2/.  Different 
strategies  for  choosing  these  parameters  have  been  proposed.  Since  al  mimics 
the  Hessian  V2/,  we  require  that  a^k\x^  —  aA^’-1))  ~  V f(x^)  —  V/(a;(fe-1)) 
in  the  least-squares  sense.  Hence, 


a 


(fc+i) 


argmina  ||a(aAfc)  —  x^k  ^) 
-  (V/(aAfc))-  V/(£c(fe-1))) 

(x<fe> -x<'=-1))T(V/(a:<fc>)-V/(x 


2 

2 

(fc-i) 


) 


(x(fc)_x(fc  —  1))T(x(fc)_ x(fc  — x)) 


(41) 


This  is  known  as  the  Barzilai-Borwein  equation  [3,  43,  50]. 

For  choosing  A,  instead  of  using  a  fixed  value,  several  papers  have  proposed  a 
continuation  strategy  [26,  23],  in  which  (37)  is  solved  for  a  decreasing  sequence  of 
A.  As  mentioned  in  Section  2.2,  (37)  recovers  the  optimal  t^-min  solution  when 
A  — >  0.  However,  it  has  been  observed  that  the  practical  performance  degrades 
by  directly  solving  (37)  for  small  values  of  A,  which  has  been  dubbed  as  a  “cold” 
starting  point.  Instead,  continuation  employs  a  warm-starting  strategy  by  first 
solving  (37)  for  a  larger  value  of  A,  then  decreasing  A  in  steps  towards  its  desired 
value. 
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The  1ST  algorithm  is  summarized  in  Algorithm  3.'  The  interested  reader  is 
referred  to  [33]  for  guidelines  on  choosing  good  soft-thresholding  parameters  to 
improve  accuracy. 


Algorithm  3  Iterative  Shrinkage-Thresholding  (1ST) 

Input:  A  full  rank  matrix  A  £  ]Rdxra,  d  <  n,  a  vector 

b  £  ]Rd,  Lagrangian  Ao,  initial  values  for  x and  a0,  k  <— 

0. 

1:  Generate  a  reducing  sequence  Ao>Ai>--->Aat. 

2:  for  i  =  0, 1,  •  •  •  ,  N  do 
3:  A  i —  A?; 

4:  repeat 

5:  k  i —  k  A  1. 

6:  £-  G{x('k~1'>). 

7:  Update  a ^  using  (41). 

8:  until  The  objective  function  F(x^)  decreases. 

9:  end  for 
Output:  x*  £-  x^k\ 


2.4  Proximal  Gradient  Methods 

Proximal  Gradient  (PG)  algorithms  represent  another  class  of  algorithms  that 
solve  convex  optimization  problems  of  the  form  (35).  Assume  that  our  cost 
function  F(-)  can  be  decomposed  as  the  sum  of  two  functions  /  and  g,  where  / 
is  a  smooth  convex  function  with  Lipschitz  continuous  gradient,  and  g  is  a  con¬ 
tinuous  convex  function.  The  principle  behind  these  algorithms  is  to  iteratively 
form  quadratic  approximations  Q(x,  y)  to  F{x)  around  a  carefully  chosen  point 
y ,  and  to  minimize  Q(x ,  y)  rather  than  the  original  cost  function  F . 

For  our  problem,  we  define  gix)  =  j|a:||i  and  f(x)  =  |||Aa:  —  b |||.  We 
note  that  V  f(x)  =  AT{Ax  —  b)  is  Lipschitz  continuous  with  Lipschitz  constant 
Lf  =  ||A||2.  Define  Q(x,y)  as: 

Q(x,y)  =  f(y)  +  (V/(y),  x  —  y)  +  ^  ||a;  -  y||2  +  A g(x).  (42) 

It  can  be  shown  that  F(x )  <  Q(x,y)  for  all  y ,  and 

argmin  Q(x,y)  =  argmin  j  A  gix)  +  \\x  —  tt||2l  ,  (43) 

X  X  I  2  J 

where  u  =  y—  Wy/(y)  same  trick  used  in  (37).  For  the  L|-min  problem, 

(43)  has  a  closed-form  solution  given  by  the  soft-thresholding  function: 

argmin  Q(x,y)  =  soft  (  u,  )  .  (44) 

x  V  Lf  J 

7  A  MATLAB  implementation  called  Sparse  Reconstruction  by  Separable  Approximation 
(SpaRSA)  [50]  is  available  at  http://www.lx.it.pt/~mtf/SpaRSA/. 
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However,  unlike  the  iterative  thresholding  algorithm  described  earlier,  we  use 
a  smoothed  computation  of  the  sequence  yk.  It  has  been  shown  that  choosing 

y(fc)  =  *(*>  +  (*<*>  -  ®(fe”1))  ,  (45) 

where  {ffc}  is  a  positive  real  sequence  satisfying  tf.  —  tk  <  t\_  i,  achieves  an 
accelerated  non-asymptotic  convergence  rate  of  0{k~2)  [38,  39,  5].  To  further 
accelerate  the  convergence  of  the  algorithm,  one  can  also  make  use  of  the  con¬ 
tinuation  technique  as  in  Section  2.3. 

Finally,  for  large  problems,  it  is  often  computationally  expensive  to  directly 
compute  Lf  =  ||A||2.8  A  backtracking  line-search  strategy  [5]  can  be  used  to 
generate  a  scalar  sequence  {£/,-}  that  approximates  Lf.  We  define 

Ql{x,v)  =  f(y)  +  (x  —  y)TV/(y)  +  ^\\x  —  y\\2  +  Xg(x).  (46) 

Suppose  that  77  >  1  is  a  pre-defined  constant.  Then,  given  y ^  at  the  fcth 
iteration,  we  set  Lk  =  y3  Lk_  1,  where  j  is  the  smallest  nonnegative  integer  such 
that  the  following  inequality  holds: 

F(GLk(y (fe)))  <  QLi(GLk{yW),yW),  (47) 

where  GL(y)  =  argmin^.  QL{x,y)  =  soft  (u,  £)  for  u  =  y  -  fV/(y). 

The  algorithm,  dubbed  FISTA  in  [5],  is  summarized  as  Algorithm  4. 9  The 
convergence  behavior  of  FISTA  is  given  by 


F{x{k))  -  F{x*)  < 


2£/||a;(0)  -  a;*||2 

{k  +  l)2 


V  k. 


(48) 


The  interested  reader  may  refer  to  [39,  5,  6]  for  a  proof  of  the  above  result. 


2.5  Augmented  Lagrange  Multiplier  Methods 

Lagrange  multiplier  methods  are  a  popular  class  of  algorithms  in  convex  pro¬ 
gramming.  The  basic  idea  is  to  eliminate  equality  constraints  and  instead  add 
a  penalty  term  to  the  cost  function  that  assigns  a  very  high  cost  to  infeasi¬ 
ble  points.  Augmented  Lagrange  Multiplier  (ALM)  methods  differ  from  other 
penalty-based  approaches  by  simultaneously  estimating  the  optimal  solution 
and  Lagrange  multipliers  in  an  iterative  fashion. 

We  consider  the  general  £-\ -min  problem  (1)  with  the  optimal  solution  x* . 
The  corresponding  augmented  Lagrangian  function  is  given  by 

Lu,(x,y)  =  ||£c||  1  +  (y,  b  —  Ax)  +  ^\\b-Ax\\l,  (49) 

8  This  problem  occurs  in  the  1ST  algorithm  as  well. 

9  An  implementation  of  FISTA  can  be  downloaded  from  the  website  of  this  paper.  Another 
Matlab  toolbox  called  NESTA  [6]  is  available  at:  http://www.acm.caltech.edu/~nesta/. 
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Algorithm  4  Fast  Iterative  Shrinkage-Threshold  Algorithm  (FISTA) 


Input:  b  G  Rm,  A  G  Rmx”. 
l:  Set  t—  0,  «—  0,  to  t—  1,  t±  t—  1,  k  t—  1. 

2:  Initialize  L0,  Ai,  /3  G  (0, 1),  A  >  0. 

3:  while  not  converged  do 
4:  (a;(fe)  —  a;^-1)). 

5:  Update  Lk  using  (47)  with  . 

6:  u ^  y ^  AT(Ay (fe)  —  6). 

7:  a:(fc+1) -f- soft 


8:  tk+ 1  •<— 


1  +  ^4^  +  ! 
2 


9:  Afc+i  t- max(/3Afc,  A). 

10:  fct-fc  +  1. 

11:  end  while 

Output:  x*  i -  x^k\ 


where  y  >  0  is  a  constant  that  determines  the  penalty  for  infeasibility,  and 
y  is  a  vector  of  Lagrange  multipliers.  Let  y*  be  a  Lagrange  multiplier  vector 
satisfying  the  second-order  sufficiency  conditions  for  optimality  (see  section  3.2 
in  [8]  for  more  details).  Then,  for  sufficiently  large  y,  it  can  be  shown  that 

x*  =  argmin  LJx,y*).  (50) 

X 

The  main  problem  with  this  solution  is  that  y*  is  unknown  in  general.  Fur¬ 
thermore,  the  choice  of  y  is  not  straightforward  from  the  above  formulation.  It 
is  clear  that  to  compute  x*  by  minimizing  L/Ji(x1y),  we  must  choose  y  close 
to  y*  and  set  y  to  be  a  very  large  positive  constant.  The  following  iterative 
procedure  has  been  proposed  in  [8]  to  simultaneously  compute  y*  and  x*: 

f  xk+1  =  argminx  L^k(x,yk)  ^ 

\  yk+ 1  =  yk  +  yk{b-  Axk+1)  ’ 

where  {yk}  is  a  monotonically  increasing  positive  sequence.  We  note  that  the 
first  step  in  the  above  procedure  is  itself  an  unconstrained  convex  optimization 
problem.  Thus,  the  above  iterative  procedure  is  computationally  efficient  only  if 
it  is  easier  to  minimize  the  augmented  Lagrangian  function  compared  to  solving 
the  the  original  constrained  optimization  problem  (1)  directly. 

We  focus  our  attention  on  solving  the  first  step  of  (51)  for  the  fi-min  prob¬ 
lem.  Although  it  is  not  possible  to  obtain  a  closed-form  solution,  we  note 
that  the  cost  function  has  the  same  form  as  described  in  (35).  Furthermore, 
the  quadratic  penalty  term  is  smooth  and  has  a  Lipschitz  continuous  gradi¬ 
ent.  Therefore,  we  can  solve  it  efficiently  using  proximal  gradient  methods  (e.g., 
FISTA)  described  in  the  Section  2.4. 

The  entire  algorithm  is  summarized  as  Algorithm  5,  where  r  represents  the 
largest  eigenvalue  of  the  matrix  ATA,  and  p  >  1  is  a  constant.  Although  the 
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steps  in  Algorithm  5  resemble  those  of  Algorithm  4  very  closely,  we  will  later 
see  that  ALM  in  general  is  more  accurate  and  converges  faster. 


Algorithm  5  Augmented  Lagrange  Multiplier  (ALM) 
Input:  b  €  Rm,  A  G  Rmx”. 

1:  while  not  converged  (fc  =  1,2,...)  do 
2:  t\  4—  1,  Z\  4—  £Cfc,  Ui  4—  X fc 

3:  while  not  converged  (l  =  1,2,...)  do 

4:  ul+1  <-  soft  (*,  -  (As,  -  b 

5:  tj+l  t—  |  ^1  +  ^/T+4tf) 

6:  Z;+i  <-  U(+i  +  JUyK+l  -  Ut) 

7:  end  while 

8:  Xk+l  4—  Ui+ 1 

9:  2/fe+i  t-  yk  +  Hk(b  —  Aa?fc+i) 

10:  Hk+ 1  -f-  p  •  /Life 

11:  end  while 
Output:  a;*  ■<—  a:^. 


We  would  like  to  point  out  that  a  similar  algorithm  called  Alternating  Di¬ 
rections  Method  (ADM)  [52]  essentially  shares  the  same  idea  as  above.  10  The 
major  difference  is  that  ADM  would  approximate  the  solution  to  the  first  step 
of  the  ALM  iteration  in  (51).  This  is  done  by  computing  only  one  iteration  of 
the  FISTA  algorithm.  Although  this  approximation  yields  a  gain  in  computa¬ 
tional  speed,  we  have  observed  from  experiments  that  the  convergence  behavior 
of  ADM  may  vary  depending  on  the  distribution  of  the  matrix  A. 

Algorithm  5  shows  the  ALM  method  applied  to  the  primal  problem  (1), 
which  is  referred  to  as  Primal  ALM  (PALM)  in  this  paper.  Interestingly,  the 
principles  of  ALM  can  also  be  applied  to  its  dual  problem  [52]: 

max bT y  subj.  to  ATy  e  (52) 

y 

where  B^°  =  {x  £  Rn  :  ||a;||oo  <  1}.  Under  certain  circumstances,  this  may 
result  in  a  more  computationally  efficient  algorithm.  In  the  rest  of  the  section, 
we  will  briefly  explain  the  dual  ALM  (DALM)  algorithm. 

Simple  computation  shows  that  DALM  solves  the  following  problem: 

mhij,^*  -bT y  -  xT(z  -  ATy)  +  § ||z  -  ATy\\\  ,  , 

subj.  to  2  €  Bf.  K  ’ 

Here,  x  as  the  primal  variable  becomes  the  associated  Lagrange  multiplier  in  the 
dual  space  [52].  Since  it  is  difficult  to  solve  the  above  problem  simultaneously 
w.r.t.  y,  x  and  z,  we  adopt  a  strategy  by  alternately  updating  the  primal 
variable  x  and  the  dual  variables  y  and  2. 


10  A  MATLAB  toolbox  of  the  ADM  algorithm  called  YALL1  is  provided  at:  http : //yalll . 
blogs .rice . edu/. 
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On  the  one  hand,  for  x  =  xk  and  y  =  yk,  the  minimizer  zk+ 1  with  respect 
to  z  is  given  by 

Zk+i  =  VB “  ( ATyk  +  xk/P),  (54) 

where  Vb°°  represents  the  projection  operator  onto  B^°.  On  the  other  hand, 
given  x  =  xk  and  z  =  zk+i,  the  minimization  with  respect  to  y  is  a  least  squares 
problem,  whose  solution  is  given  by  the  solution  to  the  following  equation: 

/3AATy  =  f3Azk+i  —  (Axk  —  b).  (55) 

Suppose  that  AAT  is  invertible.  Then,  we  directly  use  its  inverse  to  solve  (55). 
However,  for  large  scale  problems,  this  matrix  inversion  can  be  computationally 
expensive.  Therefore,  in  such  cases,  we  can  approximate  the  solution  with  one 
step  of  the  Conjugate  Gradient  algorithm  in  the  y  direction  at  each  iteration, 
as  proposed  in  [52]. 

For  our  face  recognition  application,  the  optimization  problem  of  interest  is 
of  the  form  (3).  For  this  problem,  the  basic  iteration  of  the  DALM  algorithm  is 
given  by 

(  zk+1  =  VB^{BTyk  +  wk/P), 

<  yk+ 1  =  (BBT)~1(Bzk+ 1  -  (Bwk  -  b)//3),  (56) 

[  =  wk-  /3(zk+ 1  -  BTyk+1), 

where  B  =  [A,  I],  and  wk  =  [xk\ek\.  As  we  will  see  in  Section  4,  the  matrix 
AAT  is  often  easily  invertible,  and  so  we  solve  (55)  exactly. 

Since  all  the  subproblems  now  are  solved  exactly,  the  convergence  of  the  dual 
algorithm  is  guaranteed.  Furthermore,  since  its  major  computation  lies  in  solv¬ 
ing  for  the  dual  variable  y,  when  the  number  of  dual  variables  are  much  smaller 
than  the  number  of  primal  variables  (i.e. ,  when  d,  <C  n),  the  dual  algorithm 
could  be  more  efficient  than  the  primal  algorithm.11 

3  Simulation:  Random  Sparse  Signals 

In  this  section,  we  present  two  sets  of  experiments  to  benchmark  the  perfor¬ 
mance  of  the  five  fast  G-min  algorithms  on  random  sparse  signals,  namely, 
L1LS/TNIPM,  Homotopy,  SpaRSA/IST,  FISTA,  and  DALM,  together  with 
the  basic  BP  algorithm  (i.e.,  PDIPA).  All  experiments  are  performed  in  MAT- 
LAB  on  a  Dell  PowerEdge  1900  workstation  with  dual  quad-core  2.66GHz  Xeon 
processors  and  8GB  of  memory. 

Before  we  present  the  results,  we  need  to  caution  that  the  performance  of 
these  algorithms  is  affected  by  multiple  factors,  including  the  algorithm  imple¬ 
mentation,  the  chosen  ad-hoc  parameters,  and  even  the  MATLAB  environment 
itself.  One  factor  that  we  should  pay  special  attention  to  is  the  stopping  criteria 
used  in  benchmarking  these  algorithms.  As  we  first  mentioned  in  Section  1.2, 
choosing  a  good  stopping  criterion  is  important  to  properly  exit  an  iteration 
when  the  estimate  becomes  close  to  a  local  or  global  optimum.  On  one  hand, 

11  The  PALM  and  DALM  algorithms  in  MATLAB  can  be  downloaded  from  the  website  of 
this  paper. 
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in  general,  straightforward  rules  do  exist,  such  as  the  relative  change  of  the 
objective  function: 


||F(x(fc+1))  -F(tc(fe))|| 
\\F(xW)\\ 


(57) 


or  the  relative  change  of  the  estimate: 


H^fc+i)  _  aj(fc)|| 

ll*(fc)ll 


(58) 


However,  their  efficacy  depends  on  a  proper  step  size  of  the  update  rule:  If  the 
step  size  is  poorly  chosen,  the  algorithm  may  terminate  prematurely  when  the 
estimate  is  still  far  away  from  the  optimum.  On  the  other  hand,  certain  special 
criteria  are  more  effective  to  some  algorithms  than  the  others.  For  example,  for 
PDIPA,  it  is  natural  to  use  the  (relative)  duality  gap  between  the  primal  and 
dual  solutions;  for  Homotopy,  it  is  easy  to  measure  the  relative  change  of  the 
sparse  support  as  the  stopping  criterion. 

In  this  section,  since  the  experiment  is  performed  on  synthetic  data,  we  use 
a  simple  threshold  that  compares  the  f^-norm  difference  between  the  £-\  -min 
estimate  x*  and  the  ground  truth  Xq  as  the  stopping  criterion. 


3.1  p-S  Plot  in  the  Noise-Free  Case 

The  first  experiment  is  designed  to  measure  the  accuracy  of  the  various  algo¬ 
rithms  in  recovering  sparse  signals  in  the  noise- free  case  (Pi).  We  evaluate  each 
algorithm  using  a  p-S  plot,  where  the  sparsity  rate  p  =  k/n  G  (0,1]  and  the 
sampling  rate  5  =  d/n  G  (0, 1].  At  each  S,  the  percentages  of  successes  that  an 
fi-min  algorithm  finds  the  ground-truth  solution  Xq  (with  a  very  small  toler¬ 
ance  threshold)  are  measured  over  different  p’s.  Then  a  fixed  success  rate,  say 
of  95%,  over  all  As  can  be  interpolated  as  a  curve  in  the  p-S  plot.  In  general, 
the  higher  the  success  rates,  the  better  an  algorithm  can  recover  dense  signals 
in  the  (Pi)  problem. 

Figure  2  shows  the  95%  success-rate  curves  for  the  six  algorithms.  In  the 
simulation,  the  ambient  dimension  d  —  1000  is  fixed.  For  a  fixed  pair  (p,  <5) ,  the 
support  of  the  ground-truth  signal  Xq  is  chosen  uniformly  at  random,  and  the 
values  of  the  non-zero  entries  are  drawn  from  the  standard  normal  distribution. 
In  addition,  the  vector  is  normalized  to  have  unit  t^-norm.  The  measurement 
matrix  A  is  generated  as  a  random  Gaussian  matrix,  each  of  whose  entries  is  i.i.d. 
randomly  generated  from  the  standard  normal  distribution  and  then  normalized 
to  have  unit  column  norm.  We  choose  to  compute  the  average  success  rates  over 
100  trials  on  the  vertices  of  a  grid  of  (p,  S)  pairs  for  each  of  the  fj-iriin  algorithms, 
and  the  coordinates  of  the  95%  rate  are  interpolated  from  the  vertex  values. 

The  observations  of  the  experiment  are  summarized  below: 

1.  Without  concerns  about  speed  and  data  noise,  the  success  rates  of  the 
interior-point  method  PDIPA  is  the  highest  of  all  the  algorithms  compared 
in  Figure  2,  especially  when  the  signal  becomes  dense. 
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Fig.  2:  The  p-S  plot  (in  color)  shows  the  95%  success-rate  curves  for  the  six  fast 
f^-min  algorithms. 


2.  The  accuracy  for  the  remaining  five  algorithm  is  separated  in  three  groups. 
In  particular,  the  success  rates  of  L1LS  and  Homotopy  are  similar  w.r.t. 
different  sparsity  rates  and  sampling  rates,  and  they  are  also  very  close 
to  PDIPA.  In  the  low  sampling-rate  regime,  Homotopy  is  slightly  better 
than  L1LS. 

3.  The  success  rates  of  FISTA  and  DALM  are  comparable  over  all  sampling 
rates.  In  the  noise-free  case,  they  are  not  as  accurate  as  the  other  exact 
t'i  -min  solutions.  However,  their  performance  shows  a  significant  improve¬ 
ment  over  the  1ST  algorithm,  namely,  SpaRSA. 

3.2  Performance  with  Moderate  Data  Noise 

We  are  more  interested  in  comparing  the  f?i-min  algorithms  when  the  measure¬ 
ment  contains  moderate  amounts  of  data  noise.  In  the  second  simulation,  we 
rank  the  six  algorithms  under  two  scenarios:  Firstly,  we  measure  the  perfor¬ 
mance  in  the  low-sparsity  regime,  where  the  ambient  dimension  n  =  2000  and 
the  sparsity  ratio  p  =  k/n  =  0.1  are  fixed,  and  the  dimension  d  of  the  Gaussian 
random  projection  varies  from  800  to  1900.  Secondly,  we  measure  the  perfor¬ 
mance  when  x  becomes  dense  w.r.t.  a  fixed  sampling  rate,  where  n  =  2000  and 
d  =  1500  are  fixed,  and  the  sparsity  ratio  p  =  k/n  varies  from  0.1  to  0.26.  The 
maximum  number  of  iterations  for  all  algorithms  is  limited  to  5000.  The  results 
are  shown  in  Figure  3  and  4,  respectively.  In  both  experiments,  we  corrupt  the 
measurement  vector  b  with  e,  an  additive  white  noise  term  whose  entries  are 
i.i.d.  according  to  a  Gaussian  distribution  with  zero  mean  and  variance  0.01. 

Firstly,  when  a  low  sparsity  ratio  of  p  =  0.1  is  fixed  in  Figure  3,  £-\ -min 


3  Simulation:  Random  Sparse  Signals 


20 


0.2tf  fr- 


CC 

O  0.2 


-  •  SpaRSA 

-  e  -  FISTA 

-  $  - DALM 


-  -fr-  ~ 


800  1000  1200  1400  1600  1800 

Projection  Dimension 

(b)  Average  run  time  (detail) 


Fig.  3:  Comparison  of  the  six  fast  C|-min  algorithms  w.r.t.  a  fixed  sparsity  ratio 
(n  =  2000,  k  =  200),  and  varying  projection  dimension  d  from  300  to 
1900. 
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Fig.  4:  Comparison  of  the  six  fast  t'l-min  algorithms  w.r.t.  a  fixed  sampling 
ratio  (n  =  2000,  d  =  1500),  and  varying  sparsity  ratio  k/n  from  0.1  to 
0.5. 
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becomes  better  conditioned  as  the  projection  dimension  increase.  However, 
the  computation  cost  also  increases  with  the  projection  dimension.  We  then 
compare  the  performance  of  the  six  algorithm  in  Figure  3: 

1.  The  computational  complexity  of  PDIPA  grows  much  faster  than  the  other 
fast  algorithms.  More  importantly,  in  contrast  to  its  performance  in  the 
noise-free  case,  the  estimation  error  also  grows  exponentially,  in  which  case 
the  algorithm  (i.e. ,  the  update  rule  (11))  fails  to  converge  to  an  estimate 
that  is  close  to  the  ground  truth. 

2.  As  the  projection  dimension  increases,  the  five  fast  fi-inin  algorithms  all 
converge  to  good  approximate  solutions.  The  estimation  error  of  Homo- 
topy  is  slightly  higher  than  the  rest  four  algorithms. 

3.  In  terms  of  speed,  L1LS  and  Homotopy  take  much  longer  time  to  converge 
than  SpaRSA,  FISTA,  and  DALM. 

4.  The  average  run  time  of  DALM  is  the  shortest  over  all  projection  dimen¬ 
sions,  which  makes  it  the  best  algorithm  in  this  comparison. 

Secondly,  when  the  projection  dimension  d  =  1500  is  fixed  in  Figure  4,  we 
compare  both  the  average  run  time  and  the  average  estimation  error  when  the 
sparsity  varies: 

1.  PDIPA  significantly  underperforms  the  rest  five  fast  algorithms  in  terms 
of  both  accuracy  and  speed,  consistent  with  the  result  in  the  previous 
simulation  with  noisy  data. 

2.  The  average  run  time  of  Homotopy  grows  almost  linearly  with  the  sparsity 
ratio,  while  the  other  algorithms  are  relatively  unaffected.  Thus,  Homo¬ 
topy  is  more  suitable  for  scenarios  where  the  unknown  signal  is  expected 
to  have  a  very  small  sparsity  ratio. 

3.  The  computational  cost  of  the  rest  four  algorithms,  namely,  L1LS,  SpaRSA, 
FISTA,  and  DALM,  remains  largely  unchanged  when  the  random  projec¬ 
tion  dimension  d  is  fixed. 

4.  DALM  is  the  fastest  algorithm  in  the  low-sparsity  regime,  but  its  run 
time  approaches  that  of  the  other  first-order  methods  in  the  high-sparsity 
regime.  Overall,  DALM  is  the  best  algorithm  in  this  scenario. 

To  summarize,  in  the  presence  of  low  levels  of  Gaussian  noise,  PDIPA  per¬ 
forms  quite  badly  in  comparison  with  the  other  fj-min  algorithms  that  employ 
more  sophisticated  techniques  to  handle  noisy  data.  Perhaps  a  more  surpris¬ 
ing  observation  is  that  when  the  coefficients  of  A  are  randomly  drawn  from  a 
Gaussian  distribution,  under  a  broad  range  of  simulation  conditions,  FISTA  is 
not  significantly  better  than  the  original  1ST  algorithm.12  However,  in  the  next 

12  We  have  observed  that  the  performance  of  FISTA  may  vary  widely  depending  on  its  chosen 
parameters.  If  the  parameters  are  tuned  at  different  noise  level,  its  speed  of  convergence  can 
be  much  improved  compared  to  1ST.  However,  for  consistency,  all  algorithms  in  this  simulation 
are  assigned  a  fixed  set  of  parameters. 
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section,  we  will  see  that  when  the  dictionary  A  is  replaced  by  real-world  data 
matrices  that  contain  training  images  in  face  recognition,  the  performance  of 
the  six  £-\  -min  algorithms  would  be  dramatically  different  from  synthetic  data. 


4  Face  Recognition  Experiment  I:  Corruption  and  Disguise 
Compensation 


In  this  section,  we  benchmark  the  performance  of  the  six  algorithms  in  robust 
face  recognition.  The  experiment  is  set  up  to  estimate  sparse  representation 
of  real  face  images  based  on  the  cross- and-bouquet  (CAB)  model  introduced  in 

[47]. 

More  specifically,  It  has  been  known  in  face  recognition  that  a  well-aligned 
frontal  face  image  b  under  different  lighting  and  expression  lies  close  to  a  special 
low-dimensional  linear  subspace  spanned  by  the  training  samples  from  the  same 
subject,  called  a  face  subspace  [7,  4]: 


A  =  [Ui,l,Vi,2,  ' 


pdxrii 


(59) 


where  vl:1  represents  the  j-th  training  image  from  the  <-tli  subject  stacked  in 
the  vector  form.  Given  C  subjects  and  a  new  test  image  b  (also  in  the  vector 
form),  we  seek  the  sparsest  linear  representation  of  the  sample  with  respect  to 
all  training  examples: 


b  =  [A1,A2,---  ,Ac)[x i; £c2;  -  •  •  ;®c]  =  Ax,  (60) 

where  A  £  R.dxn  collects  all  the  training  images. 

Clearly,  if  b  is  a  valid  test  image,  it  must  be  associated  with  one  of  the 
C  subjects.  Therefore,  the  corresponding  representation  in  (60)  has  a  sparse 
representation  *  =  [•■•  ;  0;  X£,  0;  •  •  •  ]:  on  average  only  a  fraction  of  ^  coefficients 
are  nonzero,  and  the  dominant  nonzero  coefficients  in  sparse  representation  x 
reveal  the  true  subject  class. 

In  addition,  we  consider  the  situation  where  the  query  image  b  may  be 
severely  occluded  or  corrupted.  The  problem  is  modeled  by  a  corrupted  set  of 
linear  equations  b  =  Ax  +  e,  where  e  £  is  an  unknown  vector  whose  nonzero 
entries  correspond  to  the  corrupted  pixels.  In  [49],  the  authors  proposed  to 
estimate  w  =  [a;;  e]  jointly  as  the  sparsest  solution  to  the  extended  equation: 

min  ||tu||i  subject  to  b  =  [A,  I]w.  (61) 

The  new  dictionary  [A,  I]  was  dubbed  a  cross-and-bouquet  model  in  the  follow¬ 
ing  sense.  The  columns  of  A  are  highly  correlated,  as  the  convex  hull  spanned  by 
all  face  images  of  all  subjects  occupies  an  extremely  tiny  portion  of  the  ambient 
space.  These  vectors  are  tightly  bundled  together  as  a  “bouquet,”  whereas  the 
vectors  associated  with  the  identity  matrix  and  its  negative  ±7  form  a  standard 
“cross”  in  Rd,  as  shown  in  Figure  5. 

In  this  section,  the  performance  of  the  six  G-min  algorithms  using  the  CAB 
model  is  benchmarked  on  the  CMU  Multi-PIE  face  database  [25].  A  subset 
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\\  /  Highly  coherent 

(  volume  <  1.5  x  10“229  ) 

Fig.  5:  The  cross-and-bouquet  model  for  face  recognition.  The  raw  images  of  hu¬ 
man  faces  expressed  as  columns  of  A  are  clustered  with  very  small  variance. 
(Courtesy  of  John  Wright  [47]) 

of  249  subjects  from  the  data  set  (Session  1)  is  used  for  this  experiment.  Each 
subject  is  captured  under  20  different  illumination  conditions  with  a  frontal  pose. 
The  images  are  then  manually  aligned  and  cropped,  and  down-sampled  to  40  x  30 
pixels.  Out  of  the  20  images  for  each  subject,  images  {0,1,7,13,14,16,18} 
with  extreme  illumination  conditions  are  chosen  as  the  training  images,  and  the 
remaining  13  images  are  used  as  test  images.  Finally,  a  certain  number  of  image 
pixels  are  randomly  corrupted  by  a  uniform  distribution  between  [0,255],  with 
the  corruption  percentage  from  0%  to  90%,  as  four  examples  shown  in  Figure 
6.  In  Table  1,  the  recognition  rates  between  50%  and  70%  pixel  corruption  are 
highlighted  for  more  detailed  comparison. 


Fig.  6:  An  aligned  face  image  of  Subject  1  in  Multi-PIE,  Session  1,  under  the  ambient 
lighting  condition  (No.  0)  is  shown  on  the  left.  On  the  right,  20%,  40%,  60%, 
and  80%  of  image  pixels  are  randomly  selected  and  corrupted  with  uniformly 
distributed  values  in  [0,  255],  respectively. 

We  measure  the  performance  of  the  algorithms  in  terms  of  the  final  recog¬ 
nition  rate  and  speed  of  execution.  In  choosing  a  proper  stopping  criterion, 
the  stopping  threshold  is  individually  tuned  for  each  algorithm  to  achieve  the 
highest  recognition  rate.  The  results  are  shown  in  Figure  8  and  9.  In  addition, 
Figure  7  shows  the  reconstructed  face  images  after  corruption  compensation, 
namely,  b  =  b  —  e,  given  by  the  Homotopy  solver. 

In  terms  of  accuracy,  Homotopy  achieves  the  best  overall  performance.  For 
instance,  with  60%  of  the  pixels  randomly  corrupted,  its  average  recognition 
rate  based  on  the  CAB  model  is  about  99%.  The  performance  of  PDIPA  is  very 
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Fig.  7:  Recovered  face  images  ( b  =  b—e)  from  Figure  6  using  the  CAB  model  (61)  and 
Homotopy.  Left:  Reconstructed  image  from  20%  pixels  corruption.  Middle 
Left:  Reconstructed  image  from  40%  pixels  corruption.  Middle  Right:  Re¬ 
constructed  image  from  60%  pixels  corruption.  Right:  Reconstructed  image 
from  80%  pixels  corruption. 


Fig.  8:  Average  recognition  accuracy  (in  percentage)  on  the  Multi-PIE  database. 


Tab.  1:  Average  recognition  rates  (in  percentage)  between  50%  and  70%  random  pixel 
corruption  on  the  multi-PIE  database.  The  highest  rates  are  in  boldface. 


Corr. 

PDIPA 

L1LS 

Homotopy 

SpaRSA 

FISTA 

DALM 

50% 

99.8 

99.5 

99.8 

97.6 

96.2 

99.5 

60% 

98.6 

96.6 

98.7 

90.5 

86.8 

96.2 

70% 

84.1 

76.3 

84.6 

63.3 

58.7 

78.8 

close  to  Homotopy,  achieving  the  second  best  overall  accuracy.  On  the  other 
hand,  FISTA  obtains  the  lowest  recognition  rates,  followed  by  SpaRSA. 

In  terms  of  speed,  Homotopy  is  also  one  of  the  fastest  algorithm.  In  particu¬ 
lar,  when  the  pixel  corruption  percentage  is  small,  the  sparsity  of  the  coefficient 
vector  w  =  [a:;  e]  is  low.  As  Homotopy  iteratively  adds  or  removes  nonzero 
coefficients  one  at  a  time,  the  algorithm  can  quickly  terminate  after  just  a  few 
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Fig.  9:  Average  run  time  (in  second)  on  the  Multi-PIE  database. 


iterations.  On  the  other  hand,  as  w  becomes  dense  when  the  pixel  corruption 
percentage  increases,  the  complexity  of  Homotopy  increases  superlinearly  and 
becomes  inefficient.  It  is  also  important  to  note  that  the  speed  of  the  two  accel¬ 
erated  iterative-thresholding  algorithms,  namely,  FISTA  and  DALM,  does  not 
increase  significantly  as  the  sparsity  of  w  increases. 

It  is  more  interesting  to  separately  compare  PDIPA,  L1LS,  and  Homotopy, 
which  provably  solve  the  (Pi)  problem,  and  SpaRSA,  FISTA,  and  DALM,  which 
essentially  solve  a  relaxed  version  of  (Pi).  Overall,  PDIPA,  L1LS  and  Homotopy 
perform  similarly  in  terms  of  the  accuracy,  consistent  with  our  previous  obser¬ 
vations  in  the  simulation.  However,  L1LS  is  significantly  more  expensive  from  a 
computation  point  of  view  to  achieve  the  same  recognition  rates  as  Homotopy. 
Among  the  three  iterative  soft-thresholding  methods,  the  experiments  suggest 
that  DALM  is  the  most  accurate,  as  shown  in  Figure  8. 

5  Face  Recognition  Experiment  II:  Face  Alignment 

In  this  section,  we  benchmark  the  performance  of  the  ^  -min  algorithms  in  ro¬ 
bust  face  alignment  within  the  sparse  representation  framework.  While  the 
previous  section  has  demonstrated  that  solving  the  CAB  model  achieves  state- 
of-the-art  recognition  accuracy  on  public  datasets  even  when  the  test  face  images 
are  severely  corrupted  or  occluded,  its  success  still  relies  on  the  assumption  that 
the  test  images  are  well  aligned  with  the  training  images.  However,  this  assump¬ 
tion  is  not  always  satisfied  in  practice.  As  illustrated  in  Figure  10  top,  a  test 
face  image  obtained  from  an  off-the-shelf  face  detector  is  likely  to  have  some 
registration  error  against  the  training  images.  The  registration  error  may  con¬ 
tribute  to  erroneous  linear  representation  of  the  query  image,  even  if  sufficient 
illuminations  are  present  in  the  training  set.  Fortunately,  this  alignment  issue 
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Fig.  10:  Effect  of  registration.  The  task  is  to  identify  the  girl  among  20 
subjects,  by  computing  the  sparse  representation  of  her  input  face  with 
respect  to  the  entire  training  set.  The  absolute  sum  of  the  coefficients 
associated  with  each  subject  is  plotted  on  the  right.  We  also  show 
the  faces  reconstructed  with  each  subject’s  training  images  weighted 
by  the  associated  sparse  coefficients.  The  red  line  (cross)  corresponds 
to  her  true  identity,  subject  12.  Top:  The  face  region  is  extracted  by 
Viola  and  Jones’  face  detector  (the  black  box).  Bottom:  Informative 
representation  obtained  by  using  a  well-aligned  face  region  (the  white 
box).  (Courtesy  of  [46]). 


can  be  naturally  addressed  within  the  sparse  representation  framework.  It  has 
been  shown  in  [46]  that  the  face  alignment  problem  can  be  solved  by  a  series  of 
linear  problems  that  iteratively  minimize  the  sparsity  of  the  registration  error, 
which  leads  to  an  effective  algorithm  that  works  for  face  images  under  a  large 
range  of  variations  in  scale  change,  2-D  translation  and  rotation,  and  different 
3-D  poses.  Furthermore,  it  inherits  the  robustness  property  from  the  sparse 
representation  framework,  and  hence  is  also  able  to  deal  with  faces  that  are 
partially  occluded. 

Recall  that  a  well-aligned  test  image  b0  £  can  be  represented  as  a  sparse 
linear  combination  Ax o  of  all  of  the  images  in  the  database,  plus  a  sparse  error 
eo  due  to  occlusion:  6o  =  Ax o  +  eo .  Now  suppose  that  bo  is  subject  to  some 
misalignment,  so  instead  of  observing  bo,  we  observe  the  warped  image 

b  =  bQ  o  t~  1  (62) 

for  some  transformation  r  £  T,  where  T  is  a  finite-dimensional  group  of  trans¬ 
formations  acting  on  the  image  domain.  As  seen  in  Figure  10,  directly  seeking  a 
sparse  representation  of  b  against  the  (well-aligned)  training  images  is  no  longer 
appropriate.  However,  if  the  true  deformation  r_1  can  be  efficiently  found,  then 
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Fig.  11:  Alignment  experiment  results,  (a)  Examples  of  perturbation.  The  green 
bounding  boxes  indicate  the  initial  face  location  with  some  registration 
error;  the  red  bounding  boxes  show  the  final  position  after  face  align¬ 
ment.  (b)  Average  pixel  error  (lower  the  better),  (c)  Rate  of  Success 
(higher  the  better),  (d)  Average  run  time  (lower  the  better). 


we  can  apply  its  inverse  r  to  the  test  image,  and  it  again  becomes  possible  to 
find  a  good  sparse  representation  of  the  resulting  image: 

6ot  =  Ax  +  e.  (63) 

Naturally,  we  would  like  to  use  the  sparsity  as  a  cue  for  finding  the  correct 
deformation  r.  This  can  be  formulated  as  the  following  optimization  problem: 

min  ||a;||i  +  ||e||i  subj  to  bor  =  Ax  +  e.  (64) 

£c,e,rGT 

However,  this  is  a  difficult  nonconvex  optimization  problem.  Furthermore,  due 
to  the  concern  of  local  minima,  directly  solving  (64)  may  simultaneously  align 
the  test  image  to  different  subjects  in  the  database.  Therefore,  it  is  more  ap¬ 
propriate  to  seek  the  best  alignment  with  each  subject  i  in  the  database  [46]: 

Ti  =  arg  min  ||e||i  subj  to  b  o  n  =  AiX  +  e.  (65) 

£c,e,Ti£T 

In  (65),  || a3|| i  is  not  penalized,  since  Ai  £  only  contains  the  images  of 

subject  i  and  x  is  no  longer  expected  to  be  sparse.  While  (65)  is  still  nonconvex, 
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when  a  good  initial  estimation  for  the  transformation  is  available,  e.g.,  from  the 
output  of  a  face  detector,  one  can  refine  this  initialization  to  an  estimate  of  the 
true  transformation  by  repeatedly  linearizing  about  the  the  current  estimate  of 
Tj.  This  idea  leads  to  the  final  problem  formulation  as  a  convex  optimization 
problem  as  follows: 

min  ||e||i  subj  to  bo  n  +  JiAn  =  Atx  +  e.  (66) 

x,e,A  Ti£T 

Here,  J;  =  b  or,  £  R.dxqi  is  the  Jacobian  of  bon  with  respect  to  the 

transformation  parameters  n,  and  A n  is  the  update  direction  w.r.t.  n- 

(k) 

During  each  iteration  k,  the  current  alignment  parameters  t-  ;  correct  the 
observation  as  b ^  =  b  o  r^k\  Denote  =  [Aj,  —  J^]  €.  Rdx(ru+'2'd  and 

w  =  [xT ,  A rf]T,  then  the  update  A n  can  be  computed  by  solving  the  following 
problem: 

min  lie |h  subj  to  b^  =  B^w  +  e.  (67) 

uu.e 

Interested  readers  are  referred  to  [46]  for  more  details  about  the  algorithm  and 
comprehensive  experimental  studies. 

In  this  section,  we  benchmark  the  various  G-min  algorithms  on  the  CMU 
Multi-PIE  face  database.  In  order  to  solve  (67),  we  had  to  modify  the  code 
for  each  algorithm  so  that  ||a:||i  is  no  longer  penalized  and  to  take  care  of  the 
new  constraint.  To  further  speed  up  the  algorithms,  we  take  advantage  of  the 
fact  that  the  matrix  B  is  usually  a  tall  matrix  with  much  fewer  columns  than 
rows.  For  our  experiments,  the  training  set  contains  rii  =  7  images  per  subject, 
and  we  choose  the  transformation  group  T  to  be  the  set  of  similarity  transforms 
(therefore  <j,  =  4).  So,  the  number  of  columns  in  B  is  just  rq  +  qi  =  11,  while 
the  number  of  rows  is  equal  to  the  number  of  pixels  in  each  image. 

Firstly,  we  modify  PDIPA  to  obtain  a  faster  algorithm  for  this  problem. 
We  recall  that  PDIPA  is  computationally  expensive  mainly  because  at  each 
iteration,  we  need  to  solve  a  linear  system  of  equations  the  size  of  the  dictionary 
[B,I]  for  (67).  For  instance,  if  d  =  40  x  30  =  1200,  it  results  in  inverting  a 
matrix  of  size  larger  than  1200  x  1200.  However,  if  we  know  that  B  has  very  few 
columns,  it  is  not  efficient  to  consider  the  dictionary  [5,/]  as  a  whole.  Instead, 
we  can  explicitly  write  down  the  closed-form  solution  to  the  large  linear  system 
with  B  and  /  considered  separately.  This  leads  to  a  much  smaller  problem  that 
only  requires  inverting  a  (rq  +  qi)  x  (n,;  +  q,)  =  11  x  11  matrix.  Hence,  the 
complexity  of  PDIPA  is  significantly  reduced.  We  note  that  similar  ideas  can 
be  also  applied  to  speed  up  the  other  algorithms.  In  particular,  when  matrix  B 
is  typically  an  overdetermined  matrix,  the  ALM  algorithm  that  operates  in  the 
primal  space,  namely,  PALM,  is  more  efficient  than  DALM  in  the  dual  space.  In 
the  Appendix,  modifications  for  three  f^-min  algorithms  are  briefly  explained, 
which  solve  (67)  based  on  GP,  Homotopy,  and  1ST  approaches,  respectively. 

The  performance  of  the  six  G-iriin  algorithms  is  again  evaluated  using  the 
face  images  from  the  Multi-PIE  database.  In  this  experiment,  the  249  subjects 
from  the  Session  1  are  used.  Out  of  20  illuminations,  {0, 1,  7, 13, 14, 16, 18}  are 
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chosen  as  the  training  images  and  the  illumination  10  is  used  as  the  query  im¬ 
age.  All  images  are  cropped  and  down-sampled  to  40  x  30  pixels.  Furthermore, 
the  test  images  are  manually  perturbed  up  to  8  pixels  along  the  x  and  y- axes 
in  the  canonical  frame  (with  size  40  x  30)  or  up  to  45°  in-plane  degree,  respec¬ 
tively.  Moreover,  to  test  the  robustness  of  the  f^-min  algorithms  to  occlusion, 
we  replace  a  random  located  block  of  size  10%  of  the  face  image  with  the  image 
of  a  baboon.  See  Figure  11(a)  for  an  example. 

The  accuracy  of  each  algorithm  is  measured  by  both  the  average  pixel  error 
and  the  rate  of  success.  We  consider  an  alignment  successful  if  the  difference 
between  the  final  alignment  is  within  3  pixels  of  the  ground  truth  in  the  origi¬ 
nal  image  frame  (640  x  480  image  size).  As  shown  in  Figure  11(b)  and  (c),  in 
terms  of  alignment  accuracy,  PDIPA,  PALM  and  L1LS  achieve  the  best  perfor¬ 
mance,  with  Homotopy  and  FISTA  slightly  worse.  Meanwhile,  the  performance 
of  SpaRSA  degrades  much  faster  than  the  others  as  the  perturbation  level  in¬ 
creases.  On  the  other  hand,  Figure  11(d)  shows  that  L1LS,  SpaRSA,  and  FISTA 
are  more  computational  expensive.  Overall,  PALM  is  the  fastest  algorithm, 
which  takes  25%  to  50%  less  time  compared  to  PDIPA  and  Homotopy. 

In  conclusion,  Both  PALM  and  PDIPA  performs  very  well  for  the  alignment 
problem.  Taking  into  account  the  implementation  complexity  and  the  fact  that 
PALM  is  slightly  faster  than  PDIPA,  we  conclude  that  PALM  is  the  best  choice 
for  this  problem. 

6  Conclusion  and  Discussion 

We  have  provided  a  comprehensive  review  of  five  fast  +  -min  methods,  i.e., 
gradient  projection,  homotopy,  soft  shrinkage- thresholding,  proximal  gradient, 
and  augmented  Lagrange  multiplier.  Through  extensive  experiments,  we  have 
shown  that  under  a  wide  range  of  data  conditions,  there  is  no  clear  winner  that 
always  achieves  the  best  performance  in  terms  of  both  speed  and  accuracy.  For 
perfect,  noise-free  data,  the  primal-dual  interior  point  method  (PDIPA)  is  more 
accurate  than  the  other  algorithms,  albeit  at  a  much  lower  speed.  For  noisy  data, 
first-order  techniques  (i.e.,  SpaRSA,  FISTA,  and  ALM)  are  more  efficient 

in  recovering  sparse  signals  in  both  the  low-sparsity  ratio  and  high-sparsity  ratio 
regimes.  We  have  also  considered  special  cases  of  the  t'l-min  problem  where  it 
is  applied  to  robust  face  recognition.  In  this  application,  Homotopy  and  ALM 
achieve  the  highest  recognition  rates,  and  their  computational  cost  is  also  the 
lowest  compared  to  the  other  +  -min  algorithms. 

All  experiments  described  in  this  paper  were  carried  out  in  MATLAB.  The 
benchmarking  in  terms  of  computational  speed  may  vary  significantly  when  the 
algorithms  are  implemented  in  other  programming  languages  such  as  C/C++. 
The  accuracy  of  different  algorithms  may  also  be  affected  by  the  precision  of 
different  floating-point /fixed-point  arithmetic  logic.  To  aid  in  peer  evalution, 
all  the  experimental  scripts  and  data  have  been  made  available  on  our  website. 
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7  Appendix 

In  this  section,  we  briefly  discuss  how  to  more  efficiently  solve  the  objective 
function  (67)  by  taking  advantage  of  the  intrinsic  structure  of  its  dictionary 
[B,I\.  For  clarity,  we  will  rewrite  (67)  using  the  following  notation: 

min  || e|| i  subj  to  b  =  Bw  +  e.  (68) 

w,e 

7.1  Solving  Face  Alignment  via  Gradient  Projection 
Methods 

The  objective  function  (68)  can  be  optimized  relatively  straightforward  follow¬ 
ing  the  gradient  projection  approach.  In  particular,  using  truncated  Newton 
interior-point  method,  the  objective  function  is  rewritten  as: 

min  \\\b  -  Bw  -  e||2  +  A  J2i=i  ui  (69) 

subj.  to  —  Ui  <  ei  <  Ui,  i  =  1,  •  •  •  , d.  (70) 

Then  a  logarithmic  barrier  to  bound  the  domain  of  —it,;  <  e*  <  can  be 
constructed  as: 


5>(e,  u)  =  -  ^2  lo§ K  +  ei)  -  ^2  lo§(ui  -  ei)-  (71) 

Over  the  domain  of  ( w,e,u ),  the  central  path  is  calculated  by  minimizing 
the  following  convex  function 

1  ^  ^ 

Ft(w,e,u )  =  t(-\\b-  Bw  -  e||2  +  A^u*)  +  $(e,u),  (72) 

i—l 

where  t  is  a  nonnegative  parameter.  Then,  the  update  direction  (Aw,  Ae,  Am) 
is  solved  using  Newton’s  method: 


Aw 
Ae 
A  u 


V2Ft(w,e,u)  • 


NFt(w,e,u). 


(73) 
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7.2  Solving  Face  Alignment  via  Homotopy  Methods 

In  the  Homotopy  formulation,  define  a  composite  objective  function: 

F{w,e)  =  ^||b  —  Bw  —  e||2  +  A||e||i  (74) 

=  f(w,e)  +  Xg(e).  (75) 

Setting  the  subgradient  OF  =  0  —df  =  A dg,  we  obtain  the  following  equali¬ 
ties: 

d 

A  — — g(e)  =  0  =  BT  (b  —  Bw  —  e)  (76) 

and 

c(e)  =  A  J^(e)  =  (b-  Bw-  e).  (77) 

In  initialization,  set  =  0  and  w  =  B^b.  Since  we  know  <9||e||i  £  [—1, 1], 
we  initialize  the  maximal  A  using  (77): 

A0  =  max(abs(c)),  (78) 

and  the  support  Z={i:|c»|  =  A}. 

At  each  iteration  k,  first  assuming  w  is  fixed,  the  update  direction  for  e  is 

d^(T)  =  -A-sgn(c«(T)),  (79) 

and  the  direction  for  the  coefficients  that  are  not  in  the  support  X  is  manually 
set  to  zero.  Hence, 

e(k+i)  =  £(k)  +7dW_  (80) 

In  (80),  a  proper  step  size  value  7  remains  to  be  determined.  In  Homotopy 
methods,  7  is  set  as  the  minimal  step  size  that  leads  to  either  a  coefficient  of  c 
that  is  in  the  support  X  crosses  zero,  and  hence  should  be  removed  from  X;  or 
the  magnitude  of  a  coefficient  of  c  outside  the  support  X  reaches  or  exceed  A, 
and  hence  should  be  added  to  X. 

The  first  situation  is  easy  to  evaluate: 

7_  =  min{— ej/dj}.  (81) 

However,  choosing  7+  for  the  second  situation  presents  a  problem,  as  in  (77) 
any  update  in  e  does  not  change  the  values  of  c  outside  the  support.  In  other 
words,  due  to  the  fact  that  the  dictionary  for  e  where  its  coefficients  are  sparse 
is  indeed  an  identity  matrix,  the  original  strategy  in  Homotopy  could  not  add 
any  new  indices  into  the  support  of  e.  An  alternative  solution  that  we  use  in  our 
implementation  is  to  determine  7  and  w  separately  using  the  first  constraint 
(76),  and  then  iterate. 
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7.3  Solving  Face  Alignment  via  Iterative  Soft-Thresholding 

The  composite  objective  function  F(w,  e)  =  f(w ,  e)  +  A g(e)  in  (75)  can  be  also 
solved  by  iterative  soft-thresholding  (1ST)  methods.  To  simplify  the  notation, 
denote  y  =  [in;  e]  £  Rra+d.  Then  the  objective  function  becomes: 

F(y)  =  ^\\b-[B,I]yf  +  X\\e\\1.  (82) 

Then  the  update  rule  to  minimize  F{y)  is  computed  by  substituting  a  linearized 
approximation  of  /  as  shown  below: 

rclyt'k+1')  =  argmin{(y  —  y<-fc^)TV/(j/*'fc'))  + 

v 

\\\v  ~  y(fe)l|2  '  V2/(y(fc))  +  Xg(e)}.  (83) 

We  further  introduce  an  intermediate  variable 

U{k)  =  y(k)  _  _j_V/(y(fe)^  (84) 


where  the  hessian  V2/(i/fe))  is  approximated  by  the  diagonal  matrix  a^I. 
Then  based  on  the  1ST  approach,  (83)  has  a  closed-form  solution  as  follows: 
For  i  <  n, 

(yi-u{k))'2\  _.(*0. 

Li  ^  - 

Vi 


(fc+1)  (fc+1) 

y\  =  w\  —  arg  mm  ■ 


=  U} 


(85) 


and  for  i  >  n, 


Ak+ 1) 


=  arg  mmyi 


(Vi-uf')2  ,  A  |  yd 

2  T  a(k) 


=  soft 


(u{k) 

\Ui  ’  aW  J  ■ 


(86) 
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